UFO Sighting Dataset

Looking for datasets to explore, I came across a dataset on Kaggle listing ### UFO sightings over the last century. Performed some basic cleaning, EDA and visualization on it. Here goes my attempt to present it!

Data Cleaning

Importing the basic stuff and creating the dataframe

In [196]:
import pandas as pd 
import numpy as np 
import plotly.offline as pyo



df = pd.read_csv(r"../data/complete.csv",error_bad_lines=False,warn_bad_lines=False,low_memory=False)

Let us take a look at the dataframe columns:

In [197]:
df.columns
Out[197]:
Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude'],
      dtype='object')

11 variables out of which all but latitutde and longitude are categorical and even lat/long do not have numerical significance

Taking a look at 5 samples from the dataframe.

In [198]:
df.sample(5)
Out[198]:
datetime city state country shape duration (seconds) duration (hours/min) comments date posted latitude longitude
18297 11/9/2011 04:30 albany or us circle 30 30 seconds Two bright orbs dropped from the sky and glide... 12/12/2011 44.6366667 -123.104722
73620 8/15/2001 16:00 newark oh us triangle 90 90sec first one was a bank of multi colored lights-f... 7/16/2003 40.0580556 -82.401389
57027 6/24/2001 10:00 highland mi us triangle 900 15 minutes Me and a friend saw a ufo 6/4/2004 42.6380556 -83.617222
60067 6/7/2008 21:00 mountain view ca us triangle 360 6 minutes Mysterious triangular objects over Mountain View 7/5/2008 37.3861111 -122.082778
21073 1/2/1999 17:42 seattle wa us light 4 4 seconds (observed) Walking around Greenlake on paved trail. Saw b... 5/24/1999 47.6063889 -122.330833

All the columns are pretty simple to understand, let's remove the 'duration (hours/minutes) column - vague values(eg: 2 nights, 1-2 hours) and since seconds column has same info.

In [199]:
df.drop(labels='duration (hours/min)',axis=1,inplace=True)

Taking a count of null values:

In [200]:
#let's take a count
df.isnull().sum()
Out[200]:
datetime                  0
city                      0
state                  7409
country               12365
shape                  2922
duration (seconds)        2
comments                 35
date posted               0
latitude                  0
longitude                 0
dtype: int64

State and Country have the maximum null values:

Let us try to remove these by the following method: Let's first see the 'country' for the 'state' == 'tx'

In [201]:
df[df['state']=='tx']['country']
Out[201]:
0         us
1        NaN
3         us
26        us
33        us
        ... 
88565     us
88576     us
88604     us
88631     us
88660     us
Name: country, Length: 4041, dtype: object

Let's check the number of null values here:

In [202]:
df[df['state']=='tx']['country'].isnull().sum()
Out[202]:
299

The mode here is very clearly 'us' but yet we have 299 null values. So we can replace the null values with 'us'.

Why not extend this principle and:

Iterate through each unique city and fill up empty country values

Iterate through each city and fill up empty state values

Iterate through each state and fill up empty country values

Lets check number of unique cities

In [203]:
df['city'].describe()
#there are ~22k unique cities and these can be used to fill many NaN country/state values 
Out[203]:
count       88679
unique      22018
top       seattle
freq          570
Name: city, dtype: object
In [204]:
city_list=list(df['city'].unique()) #Creating a list of all unique cities:

Let's define a function to perform the tasks mentioned above :

Fill empty using , for example - fill empty countries using city.

In [205]:
#let's define a function to perform the tasks mentioned above - fill empty __ using __ 


def fill_missing(fill_missing,fill_using): #both string values - eg: fill_missing = 'country', fill_using = 'city'
    unique_list = list(df[fill_using].unique()) #eg:list of unique cities
    for elem in unique_list: #for each city in list
        corresponding_missing_series = df[df[fill_using]==elem][fill_missing].mode() #eg: corresponding country in series form
        if len(corresponding_missing_series!=0): #if mode exists
            corresponding_missing = corresponding_missing_series.iloc[0]
            df.loc[df[fill_using]==elem,fill_missing] = df.loc[df[fill_using]==elem,fill_missing].fillna(corresponding_missing) #replace all empty
            
            

Filling up empty countries using city:

In [206]:
fill_missing('country','city')

Since the function above might not be very clear in what it is doing - let us show what that task would look like without defining it as a function

Filling up empty countries using city:

In [207]:
# #now iterating for each city 
# for city in city_list: # city list is a list of all unique cities
#     corresponding_country_series=df[df['city']==city]['country'].mode() #finding out the country which corresponds to that city
#     if len(corresponding_country_series!=0): #if a mode exists
#         corresponding_country=corresponding_country_series.iloc[0]
#         df.loc[df['city']==city,'country'] = df.loc[df['city']==city,'country'].fillna(corresponding_country) #replace all empty values with mode
        

Let's see how many null values we filled:

In [208]:
df['country'].isnull().sum()
Out[208]:
11425

Managed to fill ~1k empty fields in country.

Filling up empty states using city

In [209]:
fill_missing('state','city')

What the function actually did:

In [142]:
# #we already have city_list
# for city in city_list:
#     corresponding_state_series=df[df['city']==city]['state'].mode() #finding out the state which corresponds to that city
#     if len(corresponding_state_series!=0):
#         corresponding_state=corresponding_state_series.iloc[0]
#         df.loc[df['city']==city,'state'] = df.loc[df['city']==city,'state'].fillna(corresponding_state)

Let's see how many empty states we filled up:

In [210]:
df['state'].isnull().sum()
Out[210]:
7001

Only ~500 empty states filled.

Let's try to fill some more country fields by repeating the same code with states:

Filling up empty country using states

In [211]:
fill_missing('country','state')

Again, illustrating what our function did in the form of code just for clarity:

In [212]:
state_list=list(df['state'].unique())
#getting a list of all states

Filling up empty country using states

In [213]:
# #now iterating for each state
# for state in state_list:
#     corresponding_country_series=df[df['state']==state]['country'].mode() #finding out the country which corresponds to that city
#     if len(corresponding_country_series!=0):
#         corresponding_country=corresponding_country_series.iloc[0]
#         df.loc[df['state']==state,'country'] = df.loc[df['state']==state,'country'].fillna(corresponding_country)
In [214]:
df['country'].isnull().sum()
Out[214]:
4381

7K null columns filled up in country!

Now checking empty df values:

In [215]:
 
df.isnull().sum()
Out[215]:
datetime                 0
city                     0
state                 7001
country               4381
shape                 2922
duration (seconds)       2
comments                35
date posted              0
latitude                 0
longitude                0
dtype: int64

Let's remove rows with empty countries and empty states now.

In [216]:
df.dropna(subset=['country'],inplace=True)
df.dropna(subset=['state'],inplace=True)
In [217]:
df.isnull().sum()
Out[217]:
datetime                 0
city                     0
state                    0
country                  0
shape                 2686
duration (seconds)       1
comments                34
date posted              0
latitude                 0
longitude                0
dtype: int64

Finally, we parse the datetime and date_posted columns to datetime64 format.

In [218]:
df.dropna(subset=['duration (seconds)'],inplace=True) #dropping the one empty row remaining
In [219]:
#let's parse the datetime column into datetime64 format 
df['datetime'] = df['datetime'].str.replace("24:00","00:00") #there were columns with time as 24:00 - changing to 00:00
df['datetime_parsed'] = pd.to_datetime(df['datetime'],format ="%m/%d/%Y %H:%M" )
In [220]:
#similarly parsing date posted 
df['date_posted_parsed'] = pd.to_datetime(df['date posted'],infer_datetime_format=True)

Converting 'duration (seconds)' to float32 type:

In [221]:
df['duration (seconds)'] = df['duration (seconds)'].str.replace("`","") #lot of values had ` attached to them.
df['duration (seconds)'] = df['duration (seconds)'].astype(dtype='float32')

Data cleaning done!

EDA:

Framing a broad set of questions:

1. Which geographic regions (city/state/country-wise) have the maximum ufo sightings?

2. Is there any truth to the phrase that aliens only want invade the US(:P)?

3. How can you utilize the 'comments' column to get some/any meaningful insight?

4. Insights, if any from the shape of sightings?

Framing smaller chunks of questions

1. Which city has the maximum ufo sightings?

2. Segregate the country according to the sightings count

3. Out of the cities where sightings have occured, where have the minimum sightings occured?(say for eg, only 1)

4. Show the cities with only higher volume of sightings (more than 50)

5. Out of the cities of sightings, how many are in the USA?

6. How many distinct shapes of UFO's have been sighted?

7. Most common shape of ufo sighted?

8. Most unusual shape of ufo sighted?

9. Average time duration for which ufo's are seen?

10. Group shapes by their mean duration

xx----xx

1. Which city has the maximum UFO sightings?

Top 20 cities according to sightings frequency:

(Seattle tops the list)

Interestingly all 20 are cities in the US.

In [222]:
df['city'].value_counts()[:20]
Out[222]:
seattle          570
phoenix          488
portland         399
las vegas        395
los angeles      375
san diego        361
houston          317
chicago          295
tucson           258
miami            258
orlando          237
austin           229
albuquerque      229
springfield      228
sacramento       216
columbus         213
san francisco    204
san jose         197
denver           197
san antonio      193
Name: city, dtype: int64

2. Segregate the countries according to the sightings count

Only 4 countries, however US does top the list.

In [223]:
df['country'].value_counts()
Out[223]:
us    77558
ca     3981
au       73
gb       64
Name: country, dtype: int64

3. City with least volume of sightings - only 1

Out of ~17k unique cities - ~10k have 1 sighting only i.e more than 50 percent.

In [224]:
df_city_count = df['city'].value_counts().reset_index().rename(columns={'index':'city','city':'count'}) # df with count of cities 
df_city_count[df_city_count['count']==1][['city','count']]
Out[224]:
city count
7112 cranbury/east windsor 1
7113 big bay 1
7114 odd 1
7115 hopland 1
7116 somerset? 1
... ... ...
17888 lloyd harbor 1
17889 toulon 1
17890 sandpoint (ne of) 1
17891 jackson/west bend 1
17892 mississippi (rural) 1

10781 rows × 2 columns

4.Cities with a high volume (more than 50) sightings

Only 225 cities have greater than 50 sightings i.e Less than 2 percent.

In [225]:
df_city_count[df_city_count['count']>=50][['city','count']]
Out[225]:
city count
0 seattle 570
1 phoenix 488
2 portland 399
3 las vegas 395
4 los angeles 375
... ... ...
220 kennewick 50
221 wenatchee 50
222 syracuse 50
223 marietta 50
224 erie 50

225 rows × 2 columns

5. Out of the cities of sightings, how many are in the USA?

Out of ~17k, ~16k are in the US, overwhelming majority.

In [226]:
df[df['country']=='us']['city'].value_counts()
Out[226]:
seattle                                        570
phoenix                                        488
portland                                       399
las vegas                                      395
los angeles                                    375
                                              ... 
pasco (walla walla/franklin county line)         1
hillsbro                                         1
chihuahua mts (north of&#44 mexican border)      1
marin/san rafael                                 1
mojave (21 mi. nw of)                            1
Name: city, Length: 16564, dtype: int64

6. How many distinct shapes of UFO's have been sighted?

29! Not that many.

In [227]:
df['shape'].describe()
Out[227]:
count     78990
unique       29
top       light
freq      16556
Name: shape, dtype: object

7. Most common shape of ufo sighted?

Light (?)

8. Most unusual shape of ufo sighted?

Pyramid, Flare, Changed(?), Hexagon, Dome.

In [228]:
df['shape'].value_counts()
Out[228]:
light        16556
triangle      7956
circle        7655
fireball      6126
unknown       5882
other         5703
disk          5427
sphere        5282
oval          3767
formation     2448
cigar         2044
changing      1980
flash         1367
rectangle     1310
cylinder      1283
diamond       1185
chevron        962
egg            746
teardrop       732
cone           321
cross          241
delta            8
round            2
crescent         2
hexagon          1
dome             1
changed          1
pyramid          1
flare            1
Name: shape, dtype: int64

9. Average time duration for which ufo's are seen?

10. Group shapes by their mean duration

US has almost all types of shapes sighted - red colour predominates the plot.

Mean duration is less than 1000 minutes

Canada has an overwhelmingly large mean value for 'other' shape of UFO - probably skewed by some large values. (if we remove largest duration value in canada for 'other'- 1380000 minutes - mean = 287 only]

In [229]:
#plotting another column = converting seconds to minutes 
df['duration (minutes)'] = df['duration (seconds)']/60
#pivot table
pd.pivot_table(df,index=['shape'],columns=['country'],values=['duration (minutes)'],aggfunc='mean').plot.bar(figsize=(20, 10))
Out[229]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe7a97958d0>

Without country-wise segregation:

Crescent shaped UFO seen for highest mean duration and dome shaped UFO for least mean duration.

In [230]:
df[['shape','duration (minutes)']].groupby('shape').mean().sort_values(by='duration (minutes)', ascending = False)
Out[230]:
duration (minutes)
shape
crescent 315.083344
other 299.817444
light 210.869003
diamond 111.433624
flash 85.636543
unknown 83.921013
circle 77.377106
oval 61.707512
fireball 60.275959
changed 60.000000
cylinder 56.450382
delta 33.656250
disk 32.963181
changing 32.560684
cigar 31.628613
egg 31.190004
sphere 26.318829
cone 23.557581
triangle 21.986374
formation 20.414211
teardrop 15.165858
rectangle 10.980757
cross 10.897994
chevron 8.373873
round 7.541667
hexagon 4.000000
pyramid 2.000000
flare 0.500000
dome 0.033333
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Visualization:

In [231]:
import plotly.express as px 
import plotly.graph_objects as go 

Let's see the distribution of duration for which UFO's are sighted in terms of a pie chart:

Clearly, the time distribution is overwhelmingly in the US and Canada, with sightings for extremely small durations in Britain and Australia.

In [249]:
px.pie(df,names ='country',values='duration (minutes)',title='Duration(mean) of UFO sightings by country:')

Let's see if the distribution of number of sightings is also similar.

As seen below, number of sightings is even more one sided- with 95 percent of all sightings in US.

In [233]:
df_country_count = df['country'].value_counts().reset_index().rename(columns={'index':'country','country':'value'})
df_country_count
Out[233]:
country value
0 us 77558
1 ca 3981
2 au 73
3 gb 64
In [234]:
px.pie(df_country_count,names='country',values='value',title='Number of UFO sightings by country:')

--XX--

Let's visualize the top 10 cities in terms of the mean duration as a bar chart.

The text on top shows the country. Only one city in the top 20 is from Canada.

In [235]:
df_mean_duration = df[['country','city','duration (minutes)']].groupby(['city','country']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_mean_duration[:20],x='city',y='duration (minutes)',text='country',title='Top 20 cities for duration(mean) of UFO sightings:')

Let's check out the mean duration separately for top 20 cities of other 3 countries to see how they compare:

Starting off with Canada

In [236]:
df_ca=df[df['country']=='ca']
df_ca_mean_duration = df_ca[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_ca_mean_duration[:20],x='city',y='duration (minutes)',text='duration (minutes)',title='Top 20 Canadian cities for duration(mean) of UFO sightings')

To Australia

In [237]:
df_au=df[df['country']=='au']
df_au_mean_duration = df_au[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_au_mean_duration[:20],x='city',y='duration (minutes)',text = 'duration (minutes)',title='Top 20 Australian cities for duration(mean) of UFO sightings')

To Great Britain:

In [238]:
df_gb=df[df['country']=='gb']
df_gb_mean_duration = df_gb[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_gb_mean_duration[:20],x='city',y='duration (minutes)',text = 'duration (minutes)',title='Top 20 British cities for duration(mean) of UFO sightings')

From the above 4 graphs, we can quickly figure out that -

While Great Britain and Canada are on a similar scale for duration of sightings,

UFO's in Canada are sighted for around 100x more time duration compared to Great Britain and Canada.

Similarly, UFO's in USA are sighted for around 100x more time duration than in Canada!

----XX-------

Let's do a scatterplot of the shapes by country:

The light shape is the most popular shaped UFO sighted - more than double the number of sightings than the next most popular shape - triangle.

In [239]:
df_us_shape_count = df[df['country']=='us']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_us_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in USA by number of sightings')

Let's see if the distribution of shapes is similar for other countries as well.

Canada

In [240]:
df_ca_shape_count = df[df['country']=='ca']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_ca_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Canada by number of sightings')

Australia

In [241]:
df_au_shape_count = df[df['country']=='au']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_au_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Australia by number of sightings')

Great Britain

In [242]:
df_gb_shape_count = df[df['country']=='gb']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_gb_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Great Britain by number of sightings')

Looking at the above 4 plots, the difference in numbers is apparent.

However, 'light' remains the most commonly sighted shape in all 4 countries. In both USA and Canada, light is sighted more than double the number of times than the second most sighted shape.

The difference is less is Britain and Australia, but the data is also less for these countries - can be expected to follow a similar trend if more data is given.

In terms of common shapes - light, circle and triangle are the most commonly sighted shapes, again something that holds true for all countries.

---XX--

Choropleth for USA using the state codes given in the dataframe.

In [243]:
df['state_upper'] = df['state'].str.upper() #converting state codes to upper case
In [244]:
df['latitude'] = df['latitude'].str.replace('q','')
df['latitude']=df['latitude'].astype('float64')

#converting columns to float 64 and removing any miscellaneous strings present in the numbers
In [ ]:
 
In [245]:
df_duration=df[['state_upper','duration (minutes)','latitude','longitude']].groupby(['state_upper']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()


#dataframe for mean duration of ufo sightings per states
In [246]:
px.choropleth(df_duration,color='duration (minutes)',locations='state_upper',locationmode='USA-states',color_continuous_scale='greens')

#no proper 

Scatterplot on map showing distribution of duration of sightings!

In [247]:
px.scatter_geo(df_duration,lat = 'latitude',lon='longitude',color='duration (minutes)')
In [ ]:
 

--XX-- Only one numerical value on map so couldn't come up with correlations, etc.

In [256]:
 
Out[256]:
'4.5.3'
In [ ]: